{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "207942ee-1cad-4ff4-a64a-90aebc10d040",
   "metadata": {},
   "source": [
    "# Getting Started With Text Embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed132594-2b03-449b-87e2-2529d60aa05a",
   "metadata": {},
   "source": [
    "#### Project environment setup\n",
    "\n",
    "- Load credentials and relevant Python Libraries\n",
    "- If you were running this notebook locally, you would first install Vertex AI.  In this classroom, this is already installed.\n",
    "```Python\n",
    "!pip install google-cloud-aiplatform\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2931ae6b-23c1-4d02-a724-1bc0b1b9a820",
   "metadata": {
    "height": 63
   },
   "outputs": [],
   "source": [
    "from utils import authenticate\n",
    "credentials, PROJECT_ID = authenticate() # Get credentials and project ID"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3cfcd44-8eba-4eee-b7b4-a7be3c876685",
   "metadata": {},
   "source": [
    "#### Enter project details"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb9c9a86-f9bd-45a6-93c2-f6dbf7d32b12",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "print(PROJECT_ID)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2d411edf-7fe6-4160-9ecb-6f5e05161951",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "REGION = 'us-central1'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f9223fa5-1cd8-42a9-a967-9f2c848727ee",
   "metadata": {
    "height": 114
   },
   "outputs": [],
   "source": [
    "# Import and initialize the Vertex AI Python SDK\n",
    "\n",
    "import vertexai\n",
    "vertexai.init(project = PROJECT_ID, \n",
    "              location = REGION, \n",
    "              credentials = credentials)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0e448c9f-8311-420b-b408-a1fb712bbff4",
   "metadata": {},
   "source": [
    "#### Use the embeddings model\n",
    "- Import and load the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "681bfe86-6880-4afa-bdd6-77a8dfbc5f21",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "from vertexai.language_models import TextEmbeddingModel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa62b23a-b423-4651-88f1-b00509900be3",
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "embedding_model = TextEmbeddingModel.from_pretrained(\n",
    "    \"textembedding-gecko@001\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce2289d4-882f-4336-a720-75e80cac6900",
   "metadata": {},
   "source": [
    "- Generate a word embedding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f05aea17-cb72-4c7f-9da2-df0ee954e663",
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "embedding = embedding_model.get_embeddings(\n",
    "    [\"life\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "685a6bc7-455a-4783-93d5-fa491e82e828",
   "metadata": {},
   "source": [
    "- The returned object is a list with a single `TextEmbedding` object.\n",
    "- The `TextEmbedding.values` field stores the embeddings in a Python list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ea301db6-a439-4c0b-aae1-5e6ef943419c",
   "metadata": {
    "height": 63
   },
   "outputs": [],
   "source": [
    "vector = embedding[0].values\n",
    "print(f\"Length = {len(vector)}\")\n",
    "print(vector[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b803ae9-dd2e-421e-bc06-d6e7e0898b06",
   "metadata": {},
   "source": [
    "- Generate a sentence embedding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f53a4cd7-f89a-4931-ae16-519d1feb5197",
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "embedding = embedding_model.get_embeddings(\n",
    "    [\"What is the meaning of life?\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7ca0182e-1744-4fbb-b240-2d193ec907d4",
   "metadata": {
    "height": 63
   },
   "outputs": [],
   "source": [
    "vector = embedding[0].values\n",
    "print(f\"Length = {len(vector)}\")\n",
    "print(vector[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7179f3de-e405-4c97-aaf5-055a99c464b6",
   "metadata": {},
   "source": [
    "#### Similarity\n",
    "\n",
    "- Calculate the similarity between two sentences as a number between 0 and 1.\n",
    "- Try out your own sentences and check if the similarity calculations match your intuition."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "91b2520b-9687-4eed-be07-9ff75f1fcd74",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics.pairwise import cosine_similarity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "04483bf2-4d5d-4fd2-9aec-27711f2c780c",
   "metadata": {
    "height": 216
   },
   "outputs": [],
   "source": [
    "emb_1 = embedding_model.get_embeddings(\n",
    "    [\"What is the meaning of life?\"]) # 42!\n",
    "\n",
    "emb_2 = embedding_model.get_embeddings(\n",
    "    [\"How does one spend their time well on Earth?\"])\n",
    "\n",
    "emb_3 = embedding_model.get_embeddings(\n",
    "    [\"Would you like a salad?\"])\n",
    "\n",
    "vec_1 = [emb_1[0].values]\n",
    "vec_2 = [emb_2[0].values]\n",
    "vec_3 = [emb_3[0].values]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0e70a55-21a4-4532-be59-1b9749d639b0",
   "metadata": {},
   "source": [
    "- Note: the reason we wrap the embeddings (a Python list) in another list is because the `cosine_similarity` function expects either a 2D numpy array or a list of lists.\n",
    "```Python\n",
    "vec_1 = [emb_1[0].values]\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9b370eaa-a3b7-44e4-90a2-6dc84e460122",
   "metadata": {
    "height": 63
   },
   "outputs": [],
   "source": [
    "print(cosine_similarity(vec_1,vec_2)) \n",
    "print(cosine_similarity(vec_2,vec_3))\n",
    "print(cosine_similarity(vec_1,vec_3))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe4b69f0-e64b-477d-9466-72ba1be05043",
   "metadata": {},
   "source": [
    "#### From word to sentence embeddings\n",
    "- One possible way to calculate sentence embeddings from word embeddings is to take the average of the word embeddings.\n",
    "- This ignores word order and context, so two sentences with different meanings, but the same set of words will end up with the same sentence embedding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b88fff09-a268-4a29-a04b-7c84a588db80",
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "in_1 = \"The kids play in the park.\"\n",
    "in_2 = \"The play was for kids in the park.\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40437875-53db-4c88-8882-a2d6266d43d3",
   "metadata": {},
   "source": [
    "- Remove stop words like [\"the\", \"in\", \"for\", \"an\", \"is\"] and punctuation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3bab3a74-ec19-48cc-b5fb-1f0bd520b2c2",
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "in_pp_1 = [\"kids\", \"play\", \"park\"]\n",
    "in_pp_2 = [\"play\", \"kids\", \"park\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2202a83-a704-41aa-ae2a-ba9ee2afd351",
   "metadata": {},
   "source": [
    "- Generate one embedding for each word.  So this is a list of three lists."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "412df546-e2e2-4a90-9cd4-86b4887ab908",
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "embeddings_1 = [emb.values for emb in embedding_model.get_embeddings(in_pp_1)]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a3ac14b1-3d55-4227-9014-88447d5a08b8",
   "metadata": {},
   "source": [
    "- Use numpy to convert this list of lists into a 2D array of 3 rows and 768 columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e5bcf45a-fb96-4838-a676-fbf52afc0b32",
   "metadata": {
    "height": 63
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "emb_array_1 = np.stack(embeddings_1)\n",
    "print(emb_array_1.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a8d20a4e-c56d-4a23-83ff-97d98e86cf32",
   "metadata": {
    "height": 80
   },
   "outputs": [],
   "source": [
    "embeddings_2 = [emb.values for emb in embedding_model.get_embeddings(in_pp_2)]\n",
    "emb_array_2 = np.stack(embeddings_2)\n",
    "print(emb_array_2.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a83a56b-404d-4517-a666-c161f92ae48f",
   "metadata": {},
   "source": [
    "- Take the average embedding across the 3 word embeddings \n",
    "- You'll get a single embedding of length 768."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5d6d9b65-a0ce-49eb-9627-b550053663c3",
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "emb_1_mean = emb_array_1.mean(axis = 0) \n",
    "print(emb_1_mean.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "23ba79b5-ba81-4c7e-b1d8-642b428e16ec",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "emb_2_mean = emb_array_2.mean(axis = 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fdfe77e1-0b0c-407e-b179-862db2e2e454",
   "metadata": {},
   "source": [
    "- Check to see that taking an average of word embeddings results in two sentence embeddings that are identical."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "22dd51c8-cee8-46f1-b717-0a44ff8f1d1c",
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "print(emb_1_mean[:4])\n",
    "print(emb_2_mean[:4])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8014b28-ddfa-43e0-969b-5bd4ed0f6fdb",
   "metadata": {},
   "source": [
    "#### Get sentence embeddings from the model.\n",
    "- These sentence embeddings account for word order and context.\n",
    "- Verify that the sentence embeddings are not the same."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f69ff3c-dd02-45ae-b34c-a10f3d78614b",
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "print(in_1)\n",
    "print(in_2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "08dcfe77-013f-4cdc-a6c5-455c364b2507",
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "embedding_1 = embedding_model.get_embeddings([in_1])\n",
    "embedding_2 = embedding_model.get_embeddings([in_2])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd4e998f-01b4-4d90-b509-cb03f26c106c",
   "metadata": {
    "height": 80
   },
   "outputs": [],
   "source": [
    "vector_1 = embedding_1[0].values\n",
    "print(vector_1[:4])\n",
    "vector_2 = embedding_2[0].values\n",
    "print(vector_2[:4])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4ccfbb6f-c72b-4f5e-a86c-265bab07d9e1",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
